{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Streetcar Delay Prediction - Data Exploration\n",
    "\n",
    "Use dataset covering Toronto Transit Commission (TTC) streetcar delays 2014 - present to predict future delays and come up with recommendations for avoiding delays.\n",
    "\n",
    "Source dataset: : https://open.toronto.ca/dataset/ttc-streetcar-delay-data/\n",
    "\n",
    "This notebook contains the common data loading and exploration steps.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Streetcar routes\n",
    "\n",
    "From https://www.ttc.ca/Routes/Streetcars.jsp\n",
    "\n",
    "<table style=\"border: none\" align=\"left\">\n",
    "   </tr>\n",
    "   <tr style=\"border: none\">\n",
    "       <th style=\"border: none\"><img src=\"https://raw.githubusercontent.com/ryanmark1867/streetcarnov3/master/streetcar%20routes.jpg\" width=\"600\" alt=\"Icon\"> </th>\n",
    "   </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Streetcar vehicle IDs CLRV/ALRV\n",
    "\n",
    "From https://en.wikipedia.org/wiki/Toronto_streetcar_system_rolling_stock#CLRVs_and_ALRVs\n",
    "\n",
    "<table style=\"border: none\" align=\"left\">\n",
    "   </tr>\n",
    "   <tr style=\"border: none\">\n",
    "       <th style=\"border: none\"><img src=\"https://raw.githubusercontent.com/ryanmark1867/streetcarnov3/master/streetcarCLRV.jpg\" width=\"600\" alt=\"Icon\"> </th>\n",
    "   </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "valid streetcars [4000, 4001, 4002, 4003, 4004, 4005, 4010, 4011, 4012, 4013, 4014, 4015, 4016, 4017, 4018, 4019, 4020, 4021, 4022, 4023, 4024, 4025, 4026, 4027, 4028, 4029, 4030, 4031, 4032, 4033, 4034, 4035, 4036, 4037, 4038, 4039, 4040, 4041, 4042, 4043, 4044, 4045, 4046, 4047, 4048, 4049, 4050, 4051, 4052, 4053, 4054, 4055, 4056, 4057, 4058, 4059, 4060, 4061, 4062, 4063, 4064, 4065, 4066, 4067, 4068, 4069, 4070, 4071, 4072, 4073, 4074, 4075, 4076, 4077, 4078, 4079, 4080, 4081, 4082, 4083, 4084, 4085, 4086, 4087, 4088, 4089, 4090, 4091, 4092, 4093, 4094, 4095, 4096, 4097, 4098, 4099, 4100, 4101, 4102, 4103, 4104, 4105, 4106, 4107, 4108, 4109, 4110, 4111, 4112, 4113, 4114, 4115, 4116, 4117, 4118, 4119, 4120, 4121, 4122, 4123, 4124, 4125, 4126, 4127, 4128, 4129, 4130, 4131, 4132, 4133, 4134, 4135, 4136, 4137, 4138, 4139, 4140, 4141, 4142, 4143, 4144, 4145, 4146, 4147, 4148, 4149, 4150, 4151, 4152, 4153, 4154, 4155, 4156, 4157, 4158, 4159, 4160, 4161, 4162, 4163, 4164, 4165, 4166, 4167, 4168, 4169, 4170, 4171, 4172, 4173, 4174, 4175, 4176, 4177, 4178, 4179, 4180, 4181, 4182, 4183, 4184, 4185, 4186, 4187, 4188, 4189, 4190, 4191, 4192, 4193, 4194, 4195, 4196, 4197, 4198, 4199, 4200, 4201, 4202, 4203, 4204, 4205, 4206, 4207, 4208, 4209, 4210, 4211, 4212, 4213, 4214, 4215, 4216, 4217, 4218, 4219, 4220, 4221, 4222, 4223, 4224, 4225, 4226, 4227, 4228, 4229, 4230, 4231, 4232, 4233, 4234, 4235, 4236, 4237, 4238, 4239, 4240, 4241, 4242, 4243, 4244, 4245, 4246, 4247, 4248, 4249, 4250, 4251, 4900, 4400, 4402, 4403, 4404, 4405, 4406, 4407, 4408, 4409, 4410, 4411, 4412, 4413, 4414, 4415, 4416, 4417, 4418, 4419, 4420, 4421, 4422, 4423, 4424, 4425, 4426, 4427, 4428, 4429, 4430, 4431, 4432, 4433, 4434, 4435, 4436, 4437, 4438, 4439, 4440, 4441, 4442, 4443, 4444, 4445, 4446, 4447, 4448, 4449, 4450, 4451, 4452, 4453, 4454, 4455, 4456, 4457, 4458, 4459, 4460, 4461, 4462, 4463, 4464, 4465, 4466, 4467, 4468, 4469, 4470, 4471, 4472, 4473, 4474, 4475, 4476, 4477, 4478, 4479, 4480, 4481, 4482, 4483, 4484, 4485, 4486, 4487, 4488, 4489, 4490, 4491, 4492, 4493, 4494, 4495, 4496, 4497, 4498, 4499, 4500, 4501, 4502, 4503, 4504, 4505, 4506, 4507]\n"
     ]
    }
   ],
   "source": [
    "streetcar_vehicles = list(range(4000,4006))+ list(range(4010,4200)) +  list(range(4200,4252)) + [4900]\n",
    "streetcar_vehicles = streetcar_vehicles + [4400] + list(range(4402,4508))\n",
    "print(\"valid streetcars\",streetcar_vehicles)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Streetcar vehicle IDs Flexity\n",
    "\n",
    "From https://en.wikipedia.org/wiki/Toronto_streetcar_system_rolling_stock#CLRVs_and_ALRVs\n",
    "\n",
    "<table style=\"border: none\" align=\"left\">\n",
    "   </tr>\n",
    "   <tr style=\"border: none\">\n",
    "       <th style=\"border: none\"><img src=\"https://raw.githubusercontent.com/ryanmark1867/streetcarnov3/master/streetcarflexity.jpg\" width=\"600\" alt=\"Icon\"> </th>\n",
    "   </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Bus identification\n",
    "The following links define the valid non-streetcar vehicles that can be delayed by streetcar incidents\n",
    "\n",
    "- Buses 1xxx: https://cptdb.ca/wiki/index.php/Toronto_Transit_Commission_1000-1149\n",
    "- Buses 2xxx: https://cptdb.ca/wiki/index.php/Toronto_Transit_Commission_2000-2110,_2150-2155,_2240-2485,_2600-2619,_2700-2765,_2767-2858\n",
    "- Buses 70xx: https://cptdb.ca/wiki/index.php/Toronto_Transit_Commission_7000-7134\n",
    "- Buses 74xx: https://cptdb.ca/wiki/index.php/Toronto_Transit_Commission_7400-7499,_7500-7619,_7620-7881\n",
    "- Buses 8xxx: https://cptdb.ca/wiki/index.php/Toronto_Transit_Commission_8000-8099\n",
    "- Buses 9xxx: https://cptdb.ca/wiki/index.php/Toronto_Transit_Commission_9000-9026\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "valid vehicles [4000, 4001, 4002, 4003, 4004, 4005, 4010, 4011, 4012, 4013, 4014, 4015, 4016, 4017, 4018, 4019, 4020, 4021, 4022, 4023, 4024, 4025, 4026, 4027, 4028, 4029, 4030, 4031, 4032, 4033, 4034, 4035, 4036, 4037, 4038, 4039, 4040, 4041, 4042, 4043, 4044, 4045, 4046, 4047, 4048, 4049, 4050, 4051, 4052, 4053, 4054, 4055, 4056, 4057, 4058, 4059, 4060, 4061, 4062, 4063, 4064, 4065, 4066, 4067, 4068, 4069, 4070, 4071, 4072, 4073, 4074, 4075, 4076, 4077, 4078, 4079, 4080, 4081, 4082, 4083, 4084, 4085, 4086, 4087, 4088, 4089, 4090, 4091, 4092, 4093, 4094, 4095, 4096, 4097, 4098, 4099, 4100, 4101, 4102, 4103, 4104, 4105, 4106, 4107, 4108, 4109, 4110, 4111, 4112, 4113, 4114, 4115, 4116, 4117, 4118, 4119, 4120, 4121, 4122, 4123, 4124, 4125, 4126, 4127, 4128, 4129, 4130, 4131, 4132, 4133, 4134, 4135, 4136, 4137, 4138, 4139, 4140, 4141, 4142, 4143, 4144, 4145, 4146, 4147, 4148, 4149, 4150, 4151, 4152, 4153, 4154, 4155, 4156, 4157, 4158, 4159, 4160, 4161, 4162, 4163, 4164, 4165, 4166, 4167, 4168, 4169, 4170, 4171, 4172, 4173, 4174, 4175, 4176, 4177, 4178, 4179, 4180, 4181, 4182, 4183, 4184, 4185, 4186, 4187, 4188, 4189, 4190, 4191, 4192, 4193, 4194, 4195, 4196, 4197, 4198, 4199, 4200, 4201, 4202, 4203, 4204, 4205, 4206, 4207, 4208, 4209, 4210, 4211, 4212, 4213, 4214, 4215, 4216, 4217, 4218, 4219, 4220, 4221, 4222, 4223, 4224, 4225, 4226, 4227, 4228, 4229, 4230, 4231, 4232, 4233, 4234, 4235, 4236, 4237, 4238, 4239, 4240, 4241, 4242, 4243, 4244, 4245, 4246, 4247, 4248, 4249, 4250, 4251, 4900, 4400, 4402, 4403, 4404, 4405, 4406, 4407, 4408, 4409, 4410, 4411, 4412, 4413, 4414, 4415, 4416, 4417, 4418, 4419, 4420, 4421, 4422, 4423, 4424, 4425, 4426, 4427, 4428, 4429, 4430, 4431, 4432, 4433, 4434, 4435, 4436, 4437, 4438, 4439, 4440, 4441, 4442, 4443, 4444, 4445, 4446, 4447, 4448, 4449, 4450, 4451, 4452, 4453, 4454, 4455, 4456, 4457, 4458, 4459, 4460, 4461, 4462, 4463, 4464, 4465, 4466, 4467, 4468, 4469, 4470, 4471, 4472, 4473, 4474, 4475, 4476, 4477, 4478, 4479, 4480, 4481, 4482, 4483, 4484, 4485, 4486, 4487, 4488, 4489, 4490, 4491, 4492, 4493, 4494, 4495, 4496, 4497, 4498, 4499, 4500, 4501, 4502, 4503, 4504, 4505, 4506, 4507, 1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, 1011, 1012, 1013, 1014, 1015, 1016, 1017, 1018, 1019, 1020, 1021, 1022, 1023, 1024, 1025, 1026, 1027, 1028, 1029, 1030, 1031, 1032, 1033, 1034, 1035, 1036, 1037, 1038, 1039, 1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055, 1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1068, 1069, 1070, 1071, 1072, 1073, 1074, 1075, 1076, 1077, 1078, 1079, 1080, 1081, 1082, 1083, 1084, 1085, 1086, 1087, 1088, 1089, 1090, 1091, 1092, 1093, 1094, 1095, 1096, 1097, 1098, 1099, 1100, 1101, 1102, 1103, 1104, 1105, 1106, 1107, 1108, 1109, 1110, 1111, 1112, 1113, 1114, 1115, 1116, 1117, 1118, 1119, 1120, 1121, 1122, 1123, 1124, 1125, 1126, 1127, 1128, 1129, 1130, 1131, 1132, 1133, 1134, 1135, 1136, 1137, 1138, 1139, 1140, 1141, 1142, 1143, 1144, 1145, 1146, 1147, 1148, 1149, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025, 2026, 2027, 2028, 2029, 2030, 2031, 2032, 2033, 2034, 2035, 2036, 2037, 2038, 2039, 2040, 2041, 2042, 2043, 2044, 2045, 2046, 2047, 2048, 2049, 2050, 2051, 2052, 2053, 2054, 2055, 2056, 2057, 2058, 2059, 2060, 2061, 2062, 2063, 2064, 2065, 2066, 2067, 2068, 2069, 2070, 2071, 2072, 2073, 2074, 2075, 2076, 2077, 2078, 2079, 2080, 2081, 2082, 2083, 2084, 2085, 2086, 2087, 2088, 2089, 2090, 2091, 2092, 2093, 2094, 2095, 2096, 2097, 2098, 2099, 2100, 2101, 2102, 2103, 2104, 2105, 2106, 2107, 2108, 2109, 2110, 2150, 2151, 2152, 2153, 2154, 2155, 2240, 2241, 2242, 2243, 2244, 2245, 2246, 2247, 2248, 2249, 2250, 2251, 2252, 2253, 2254, 2255, 2256, 2257, 2258, 2259, 2260, 2261, 2262, 2263, 2264, 2265, 2266, 2267, 2268, 2269, 2270, 2271, 2272, 2273, 2274, 2275, 2276, 2277, 2278, 2279, 2280, 2281, 2282, 2283, 2284, 2285, 2286, 2287, 2288, 2289, 2290, 2291, 2292, 2293, 2294, 2295, 2296, 2297, 2298, 2299, 2300, 2301, 2302, 2303, 2304, 2305, 2306, 2307, 2308, 2309, 2310, 2311, 2312, 2313, 2314, 2315, 2316, 2317, 2318, 2319, 2320, 2321, 2322, 2323, 2324, 2325, 2326, 2327, 2328, 2329, 2330, 2331, 2332, 2333, 2334, 2335, 2336, 2337, 2338, 2339, 2340, 2341, 2342, 2343, 2344, 2345, 2346, 2347, 2348, 2349, 2350, 2351, 2352, 2353, 2354, 2355, 2356, 2357, 2358, 2359, 2360, 2361, 2362, 2363, 2364, 2365, 2366, 2367, 2368, 2369, 2370, 2371, 2372, 2373, 2374, 2375, 2376, 2377, 2378, 2379, 2380, 2381, 2382, 2383, 2384, 2385, 2386, 2387, 2388, 2389, 2390, 2391, 2392, 2393, 2394, 2395, 2396, 2397, 2398, 2399, 2400, 2401, 2402, 2403, 2404, 2405, 2406, 2407, 2408, 2409, 2410, 2411, 2412, 2413, 2414, 2415, 2416, 2417, 2418, 2419, 2420, 2421, 2422, 2423, 2424, 2425, 2426, 2427, 2428, 2429, 2430, 2431, 2432, 2433, 2434, 2435, 2436, 2437, 2438, 2439, 2440, 2441, 2442, 2443, 2444, 2445, 2446, 2447, 2448, 2449, 2450, 2451, 2452, 2453, 2454, 2455, 2456, 2457, 2458, 2459, 2460, 2461, 2462, 2463, 2464, 2465, 2466, 2467, 2468, 2469, 2470, 2471, 2472, 2473, 2474, 2475, 2476, 2477, 2478, 2479, 2480, 2481, 2482, 2483, 2484, 2485, 2600, 2601, 2602, 2603, 2604, 2605, 2606, 2607, 2608, 2609, 2610, 2611, 2612, 2613, 2614, 2615, 2616, 2617, 2618, 2619, 2700, 2701, 2702, 2703, 2704, 2705, 2706, 2707, 2708, 2709, 2710, 2711, 2712, 2713, 2714, 2715, 2716, 2717, 2718, 2719, 2720, 2721, 2722, 2723, 2724, 2725, 2726, 2727, 2728, 2729, 2730, 2731, 2732, 2733, 2734, 2735, 2736, 2737, 2738, 2739, 2740, 2741, 2742, 2743, 2744, 2745, 2746, 2747, 2748, 2749, 2750, 2751, 2752, 2753, 2754, 2755, 2756, 2757, 2758, 2759, 2760, 2761, 2762, 2763, 2764, 2765, 2767, 2768, 2769, 2770, 2771, 2772, 2773, 2774, 2775, 2776, 2777, 2778, 2779, 2780, 2781, 2782, 2783, 2784, 2785, 2786, 2787, 2788, 2789, 2790, 2791, 2792, 2793, 2794, 2795, 2796, 2797, 2798, 2799, 2800, 2801, 2802, 2803, 2804, 2805, 2806, 2807, 2808, 2809, 2810, 2811, 2812, 2813, 2814, 2815, 2816, 2817, 2818, 2819, 2820, 2821, 2822, 2823, 2824, 2825, 2826, 2827, 2828, 2829, 2830, 2831, 2832, 2833, 2834, 2835, 2836, 2837, 2838, 2839, 2840, 2841, 2842, 2843, 2844, 2845, 2846, 2847, 2848, 2849, 2850, 2851, 2852, 2853, 2854, 2855, 2856, 2857, 2858, 7000, 7001, 7002, 7003, 7004, 7005, 7006, 7007, 7008, 7009, 7010, 7011, 7012, 7013, 7014, 7015, 7016, 7017, 7018, 7019, 7020, 7021, 7022, 7023, 7024, 7025, 7026, 7027, 7028, 7029, 7030, 7031, 7032, 7033, 7034, 7035, 7036, 7037, 7038, 7039, 7040, 7041, 7042, 7043, 7044, 7045, 7046, 7047, 7048, 7049, 7050, 7051, 7052, 7053, 7054, 7055, 7056, 7057, 7058, 7059, 7060, 7061, 7062, 7063, 7064, 7065, 7066, 7067, 7068, 7069, 7070, 7071, 7072, 7073, 7074, 7075, 7076, 7077, 7078, 7079, 7080, 7081, 7082, 7083, 7084, 7085, 7086, 7087, 7088, 7089, 7090, 7091, 7092, 7093, 7094, 7095, 7096, 7097, 7098, 7099, 7100, 7101, 7102, 7103, 7104, 7105, 7106, 7107, 7108, 7109, 7110, 7111, 7112, 7113, 7114, 7115, 7116, 7117, 7118, 7119, 7120, 7121, 7122, 7123, 7124, 7125, 7126, 7127, 7128, 7129, 7130, 7131, 7132, 7133, 7134, 7400, 7401, 7402, 7403, 7404, 7405, 7406, 7407, 7408, 7409, 7410, 7411, 7412, 7413, 7414, 7415, 7416, 7417, 7418, 7419, 7420, 7421, 7422, 7423, 7424, 7425, 7426, 7427, 7428, 7429, 7430, 7431, 7432, 7433, 7434, 7435, 7436, 7437, 7438, 7439, 7440, 7441, 7442, 7443, 7444, 7445, 7446, 7447, 7448, 7449, 7500, 7501, 7502, 7503, 7504, 7505, 7506, 7507, 7508, 7509, 7510, 7511, 7512, 7513, 7514, 7515, 7516, 7517, 7518, 7519, 7520, 7521, 7522, 7523, 7524, 7525, 7526, 7527, 7528, 7529, 7530, 7531, 7532, 7533, 7534, 7535, 7536, 7537, 7538, 7539, 7540, 7541, 7542, 7543, 7544, 7545, 7546, 7547, 7548, 7549, 7550, 7551, 7552, 7553, 7554, 7555, 7556, 7557, 7558, 7559, 7560, 7561, 7562, 7563, 7564, 7565, 7566, 7567, 7568, 7569, 7570, 7571, 7572, 7573, 7574, 7575, 7576, 7577, 7578, 7579, 7580, 7581, 7582, 7583, 7584, 7585, 7586, 7587, 7588, 7589, 7590, 7591, 7592, 7593, 7594, 7595, 7596, 7597, 7598, 7599, 7600, 7601, 7602, 7603, 7604, 7605, 7606, 7607, 7608, 7609, 7610, 7611, 7612, 7613, 7614, 7615, 7616, 7617, 7618, 7619, 7620, 7621, 7622, 7623, 7624, 7625, 7626, 7627, 7628, 7629, 7630, 7631, 7632, 7633, 7634, 7635, 7636, 7637, 7638, 7639, 7640, 7641, 7642, 7643, 7644, 7645, 7646, 7647, 7648, 7649, 7650, 7651, 7652, 7653, 7654, 7655, 7656, 7657, 7658, 7659, 7660, 7661, 7662, 7663, 7664, 7665, 7666, 7667, 7668, 7669, 7670, 7671, 7672, 7673, 7674, 7675, 7676, 7677, 7678, 7679, 7680, 7681, 7682, 7683, 7684, 7685, 7686, 7687, 7688, 7689, 7690, 7691, 7692, 7693, 7694, 7695, 7696, 7697, 7698, 7699, 7700, 7701, 7702, 7703, 7704, 7705, 7706, 7707, 7708, 7709, 7710, 7711, 7712, 7713, 7714, 7715, 7716, 7717, 7718, 7719, 7720, 7721, 7722, 7723, 7724, 7725, 7726, 7727, 7728, 7729, 7730, 7731, 7732, 7733, 7734, 7735, 7736, 7737, 7738, 7739, 7740, 7741, 7742, 7743, 7744, 7745, 7746, 7747, 7748, 7749, 7750, 7751, 7752, 7753, 7754, 7755, 7756, 7757, 7758, 7759, 7760, 7761, 7762, 7763, 7764, 7765, 7766, 7767, 7768, 7769, 7770, 7771, 7772, 7773, 7774, 7775, 7776, 7777, 7778, 7779, 7780, 7781, 7782, 7783, 7784, 7785, 7786, 7787, 7788, 7789, 7790, 7791, 7792, 7793, 7794, 7795, 7796, 7797, 7798, 7799, 7800, 7801, 7802, 7803, 7804, 7805, 7806, 7807, 7808, 7809, 7810, 7811, 7812, 7813, 7814, 7815, 7816, 7817, 7818, 7819, 7820, 7821, 7822, 7823, 7824, 7825, 7826, 7827, 7828, 7829, 7830, 7831, 7832, 7833, 7834, 7835, 7836, 7837, 7838, 7839, 7840, 7841, 7842, 7843, 7844, 7845, 7846, 7847, 7848, 7849, 7850, 7851, 7852, 7853, 7854, 7855, 7856, 7857, 7858, 7859, 7860, 7861, 7862, 7863, 7864, 7865, 7866, 7867, 7868, 7869, 7870, 7871, 7872, 7873, 7874, 7875, 7876, 7877, 7878, 7879, 7880, 7881, 8000, 8001, 8002, 8003, 8004, 8005, 8006, 8007, 8008, 8009, 8010, 8011, 8012, 8013, 8014, 8015, 8016, 8017, 8018, 8019, 8020, 8021, 8022, 8023, 8024, 8025, 8026, 8027, 8028, 8029, 8030, 8031, 8032, 8033, 8034, 8035, 8036, 8037, 8038, 8039, 8040, 8041, 8042, 8043, 8044, 8045, 8046, 8047, 8048, 8049, 8050, 8051, 8052, 8053, 8054, 8055, 8056, 8057, 8058, 8059, 8060, 8061, 8062, 8063, 8064, 8065, 8066, 8067, 8068, 8069, 8070, 8071, 8072, 8073, 8074, 8075, 8076, 8077, 8078, 8079, 8080, 8081, 8082, 8083, 8084, 8085, 8086, 8087, 8088, 8089, 8090, 8091, 8092, 8093, 8094, 8095, 8096, 8097, 8098, 8099, 9000, 9001, 9002, 9003, 9004, 9005, 9006, 9007, 9008, 9009, 9010, 9011, 9012, 9013, 9014, 9015, 9016, 9017, 9018, 9019, 9020, 9021, 9022, 9023, 9024, 9025, 9026]\n"
     ]
    }
   ],
   "source": [
    "bus_vehicles = list(range(1000,1150))+ list(range(2000,2111)) + list(range(2150,2156)) + list(range(2240,2486))\n",
    "bus_vehicles = bus_vehicles + list(range(2600,2620)) + list(range(2700,2766)) + list(range(2767,2859))\n",
    "bus_vehicles = bus_vehicles + list(range(7000,7135)) + list(range(7400,7450)) + list(range(7500,7620)) + list(range(7620,7882))\n",
    "bus_vehicles = bus_vehicles + list(range(8000,8100)) + list(range(9000,9027))\n",
    "valid_vehicles = streetcar_vehicles + bus_vehicles\n",
    "print(\"valid vehicles\",valid_vehicles)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Vehicles that are not legitimate subjects of streetcar incidents\n",
    "The following vehicles are not legitimate subjects of streetcar incidents because they run on completely separate tracks (RT and subway) or they have been retired (6xxx buses).\n",
    "\n",
    "- RT cars 3xxx: https://cptdb.ca/wiki/index.php/Toronto_Transit_Commission_3000-3027\n",
    "- Subway cars 5xxx https://cptdb.ca/wiki/index.php/Toronto_Transit_Commission_5000-5371\n",
    "- Retired Buses 6xxx: https://cptdb.ca/wiki/index.php/Toronto_Transit_Commission_6000-6122\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load the valid list of TTC Streetcar routes\n",
    "valid_routes = ['501','502','503','504','505','506','509','510','511','512','301','304','306','310']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['501',\n",
       " '502',\n",
       " '503',\n",
       " '504',\n",
       " '505',\n",
       " '506',\n",
       " '509',\n",
       " '510',\n",
       " '511',\n",
       " '512',\n",
       " '301',\n",
       " '304',\n",
       " '306',\n",
       " '310']"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "valid_routes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "# original valid directions\n",
    "# valid_directions = ['E/B','W/B','N/B','S/B','B/W']\n",
    "# revised valid directions to include lowercasing and removal of '/' and simplify to single letter\n",
    "valid_directions = ['e','w','n','s','b']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "valid_days = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np # linear algebra\n",
    "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n",
    "import matplotlib.pyplot as plt\n",
    "# import seaborn as sns\n",
    "import datetime\n",
    "import os\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np # linear algebra\n",
    "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n",
    "import matplotlib.pyplot as plt\n",
    "import datetime as dt\n",
    "# common imports\n",
    "import zipfile\n",
    "import time\n",
    "# import datetime, timedelta\n",
    "import datetime\n",
    "from datetime import datetime, timedelta\n",
    "from datetime import date\n",
    "from dateutil import relativedelta\n",
    "from io import StringIO\n",
    "import pandas as pd\n",
    "import pickle\n",
    "from sklearn.base import BaseEstimator\n",
    "from sklearn.base import TransformerMixin\n",
    "from io import StringIO\n",
    "import requests\n",
    "import json\n",
    "from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler\n",
    "from sklearn.model_selection import train_test_split\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline \n",
    "import os\n",
    "import math\n",
    "from subprocess import check_output\n",
    "from IPython.display import display\n",
    "import logging\n",
    "import yaml\n",
    "from collections import Counter\n",
    "import re\n",
    "import os\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Load and Save Data\n",
    "- parse list of XLS files \n",
    "- load XLS files, tab by tab, into dataframe\n",
    "- pickle dataframe for future runs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "current directory is: C:\\personal\\manning\\deep_learning_for_structured_data\\notebooks\n",
      "path_to_yaml C:\\personal\\manning\\deep_learning_for_structured_data\\notebooks\\streetcar_data_preparation_config.yml\n"
     ]
    }
   ],
   "source": [
    "# load config file\n",
    "current_path = os.getcwd()\n",
    "print(\"current directory is: \"+current_path)\n",
    "\n",
    "path_to_yaml = os.path.join(current_path, 'streetcar_data_preparation_config.yml')\n",
    "print(\"path_to_yaml \"+path_to_yaml)\n",
    "try:\n",
    "    with open (path_to_yaml, 'r') as c_file:\n",
    "        config = yaml.safe_load(c_file)\n",
    "except Exception as e:\n",
    "    print('Error reading the config file')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "# common variables\n",
    "# control whether to load data from scratch from original source or from saved dataframe\n",
    "load_from_scratch = config['general']['load_from_scratch']\n",
    "# control whether to save dataframe with transformed data\n",
    "save_transformed_dataframe = config['general']['save_transformed_dataframe']\n",
    "# control whether rows containing erroneous values are removed from the saved dataset\n",
    "remove_bad_values = config['general']['remove_bad_values']\n",
    "# name of file containing pickled dataframe version of input (unprocessed) dataset\n",
    "pickled_input_dataframe = config['file_names']['pickled_input_dataframe']\n",
    "# name of file to which prepared data set is saved as a pickled dataframe\n",
    "pickled_output_dataframe = config['file_names']['pickled_output_dataframe']\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "load_from_scratch False\n",
      "save_transformed_dataframe True\n",
      "remove_bad_values True\n",
      "pickled_input_dataframe 2014_2019.pkl\n",
      "pickled_output_dataframe 2014_2019_df_cleaned_remove_bad_values_may16_2020.pkl\n"
     ]
    }
   ],
   "source": [
    "print(\"load_from_scratch \"+str(load_from_scratch))\n",
    "print(\"save_transformed_dataframe \"+str(save_transformed_dataframe))\n",
    "print(\"remove_bad_values \"+str(remove_bad_values))\n",
    "print(\"pickled_input_dataframe \"+str(pickled_input_dataframe))\n",
    "print(\"pickled_output_dataframe \"+str(pickled_output_dataframe))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "raw path is C:\\personal\\manning\\deep_learning_for_structured_data\\notebooks\n"
     ]
    }
   ],
   "source": [
    "# get the directory for that this notebook is in\n",
    "rawpath = os.getcwd()\n",
    "print(\"raw path is\",rawpath)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "path is C:\\personal\\manning\\deep_learning_for_structured_data\\data/\n"
     ]
    }
   ],
   "source": [
    "# data is in a directory called \"data\" that is a sibling to the directory containing the notebook\n",
    "# this code assumes you have copied to this directory all the XLS files from the source dataset: https://www.toronto.ca/city-government/data-research-maps/open-data/open-data-catalogue/#e8f359f0-2f47-3058-bf64-6ec488de52da\n",
    "path = os.path.abspath(os.path.join(rawpath, '..', 'data')) + \"/\"\n",
    "print(\"path is\", path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define categories for input columns\n",
    "def define_feature_categories(df):\n",
    "    allcols = list(df)\n",
    "    print(\"all cols\",allcols)\n",
    "    textcols = ['Incident','Location'] # \n",
    "    continuouscols = ['Min Delay','Min Gap'] \n",
    "                      # columns to deal with as continuous values - no embeddings\n",
    "    timecols = ['Report Date','Time']\n",
    "    collist = ['Day','Vehicle','Route','Direction']\n",
    "    for col in continuouscols:\n",
    "        df[col] = df[col].astype(float)\n",
    "    print('texcols: ',textcols)\n",
    "    print('continuouscols: ',continuouscols)\n",
    "    print('timecols: ',timecols)\n",
    "    print('collist: ',collist)\n",
    "    return(allcols,textcols,continuouscols,timecols,collist)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Load dataframe\n",
    "- load pickled dataframe\n",
    "- show info about the dataset\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Day</th>\n",
       "      <th>Delay</th>\n",
       "      <th>Direction</th>\n",
       "      <th>Gap</th>\n",
       "      <th>Incident</th>\n",
       "      <th>Incident ID</th>\n",
       "      <th>Location</th>\n",
       "      <th>Min Delay</th>\n",
       "      <th>Min Gap</th>\n",
       "      <th>Report Date</th>\n",
       "      <th>Route</th>\n",
       "      <th>Time</th>\n",
       "      <th>Vehicle</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>Thursday</td>\n",
       "      <td>NaN</td>\n",
       "      <td>E/B</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Late Leaving Garage</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Dundas and Roncesvalles</td>\n",
       "      <td>4.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>2014-01-02</td>\n",
       "      <td>505</td>\n",
       "      <td>06:31:00</td>\n",
       "      <td>4018.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>Thursday</td>\n",
       "      <td>NaN</td>\n",
       "      <td>E/B</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Utilized Off Route</td>\n",
       "      <td>NaN</td>\n",
       "      <td>King and Shaw</td>\n",
       "      <td>20.0</td>\n",
       "      <td>22.0</td>\n",
       "      <td>2014-01-02</td>\n",
       "      <td>504</td>\n",
       "      <td>12:43:00</td>\n",
       "      <td>4128.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>Thursday</td>\n",
       "      <td>NaN</td>\n",
       "      <td>W/B</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Held By</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Kingston road and Bingham</td>\n",
       "      <td>13.0</td>\n",
       "      <td>19.0</td>\n",
       "      <td>2014-01-02</td>\n",
       "      <td>501</td>\n",
       "      <td>14:01:00</td>\n",
       "      <td>4016.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>Thursday</td>\n",
       "      <td>NaN</td>\n",
       "      <td>W/B</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Investigation</td>\n",
       "      <td>NaN</td>\n",
       "      <td>King St. and Roncesvalles Ave.</td>\n",
       "      <td>7.0</td>\n",
       "      <td>11.0</td>\n",
       "      <td>2014-01-02</td>\n",
       "      <td>504</td>\n",
       "      <td>14:22:00</td>\n",
       "      <td>4175.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>Thursday</td>\n",
       "      <td>NaN</td>\n",
       "      <td>E/B</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Utilized Off Route</td>\n",
       "      <td>NaN</td>\n",
       "      <td>King and Bathurst</td>\n",
       "      <td>3.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>2014-01-02</td>\n",
       "      <td>504</td>\n",
       "      <td>16:42:00</td>\n",
       "      <td>4080.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        Day  Delay Direction  Gap             Incident  Incident ID  \\\n",
       "0  Thursday    NaN       E/B  NaN  Late Leaving Garage          NaN   \n",
       "1  Thursday    NaN       E/B  NaN   Utilized Off Route          NaN   \n",
       "2  Thursday    NaN       W/B  NaN              Held By          NaN   \n",
       "3  Thursday    NaN       W/B  NaN        Investigation          NaN   \n",
       "4  Thursday    NaN       E/B  NaN   Utilized Off Route          NaN   \n",
       "\n",
       "                         Location  Min Delay  Min Gap Report Date  Route  \\\n",
       "0         Dundas and Roncesvalles        4.0      8.0  2014-01-02    505   \n",
       "1                   King and Shaw       20.0     22.0  2014-01-02    504   \n",
       "2       Kingston road and Bingham       13.0     19.0  2014-01-02    501   \n",
       "3  King St. and Roncesvalles Ave.        7.0     11.0  2014-01-02    504   \n",
       "4               King and Bathurst        3.0      6.0  2014-01-02    504   \n",
       "\n",
       "       Time  Vehicle  \n",
       "0  06:31:00   4018.0  \n",
       "1  12:43:00   4128.0  \n",
       "2  14:01:00   4016.0  \n",
       "3  14:22:00   4175.0  \n",
       "4  16:42:00   4080.0  "
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# read in previously pickled dataframe containing data from s/s 2014 - 2019\n",
    "df = pd.read_pickle(os.path.join(path,pickled_input_dataframe))\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# General explorations\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Delay</th>\n",
       "      <th>Gap</th>\n",
       "      <th>Incident ID</th>\n",
       "      <th>Min Delay</th>\n",
       "      <th>Min Gap</th>\n",
       "      <th>Route</th>\n",
       "      <th>Vehicle</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>count</td>\n",
       "      <td>3444.000000</td>\n",
       "      <td>3434.000000</td>\n",
       "      <td>889.000000</td>\n",
       "      <td>75004.000000</td>\n",
       "      <td>74975.000000</td>\n",
       "      <td>78525.000000</td>\n",
       "      <td>73890.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>mean</td>\n",
       "      <td>14.503194</td>\n",
       "      <td>20.133663</td>\n",
       "      <td>3.901012</td>\n",
       "      <td>12.817570</td>\n",
       "      <td>18.304915</td>\n",
       "      <td>500.840688</td>\n",
       "      <td>4405.327920</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>std</td>\n",
       "      <td>38.477453</td>\n",
       "      <td>37.362669</td>\n",
       "      <td>2.861142</td>\n",
       "      <td>30.083595</td>\n",
       "      <td>33.678868</td>\n",
       "      <td>45.366162</td>\n",
       "      <td>1570.424282</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>min</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>25%</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>9.000000</td>\n",
       "      <td>501.000000</td>\n",
       "      <td>4077.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>50%</td>\n",
       "      <td>7.000000</td>\n",
       "      <td>14.000000</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>6.000000</td>\n",
       "      <td>12.000000</td>\n",
       "      <td>505.000000</td>\n",
       "      <td>4170.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>75%</td>\n",
       "      <td>11.000000</td>\n",
       "      <td>20.000000</td>\n",
       "      <td>7.000000</td>\n",
       "      <td>12.000000</td>\n",
       "      <td>20.000000</td>\n",
       "      <td>509.000000</td>\n",
       "      <td>4412.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>max</td>\n",
       "      <td>996.000000</td>\n",
       "      <td>999.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>1400.000000</td>\n",
       "      <td>4216.000000</td>\n",
       "      <td>999.000000</td>\n",
       "      <td>163242.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             Delay          Gap  Incident ID     Min Delay       Min Gap  \\\n",
       "count  3444.000000  3434.000000   889.000000  75004.000000  74975.000000   \n",
       "mean     14.503194    20.133663     3.901012     12.817570     18.304915   \n",
       "std      38.477453    37.362669     2.861142     30.083595     33.678868   \n",
       "min       0.000000     0.000000     1.000000      0.000000      0.000000   \n",
       "25%       5.000000    10.000000     1.000000      5.000000      9.000000   \n",
       "50%       7.000000    14.000000     5.000000      6.000000     12.000000   \n",
       "75%      11.000000    20.000000     7.000000     12.000000     20.000000   \n",
       "max     996.000000   999.000000    10.000000   1400.000000   4216.000000   \n",
       "\n",
       "              Route        Vehicle  \n",
       "count  78525.000000   73890.000000  \n",
       "mean     500.840688    4405.327920  \n",
       "std       45.366162    1570.424282  \n",
       "min        1.000000       0.000000  \n",
       "25%      501.000000    4077.000000  \n",
       "50%      505.000000    4170.000000  \n",
       "75%      509.000000    4412.000000  \n",
       "max      999.000000  163242.000000  "
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# get various summary statistics, excluding NaN values\n",
    "df.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Day</th>\n",
       "      <th>Delay</th>\n",
       "      <th>Direction</th>\n",
       "      <th>Gap</th>\n",
       "      <th>Incident</th>\n",
       "      <th>Incident ID</th>\n",
       "      <th>Location</th>\n",
       "      <th>Min Delay</th>\n",
       "      <th>Min Gap</th>\n",
       "      <th>Report Date</th>\n",
       "      <th>Route</th>\n",
       "      <th>Time</th>\n",
       "      <th>Vehicle</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>536</td>\n",
       "      <td>Sunday</td>\n",
       "      <td>NaN</td>\n",
       "      <td>N/B</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Late Leaving Garage</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Broadview and Dundas</td>\n",
       "      <td>6.0</td>\n",
       "      <td>12.0</td>\n",
       "      <td>2017-02-12</td>\n",
       "      <td>505</td>\n",
       "      <td>11:48:00</td>\n",
       "      <td>4130.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>106</td>\n",
       "      <td>Thursday</td>\n",
       "      <td>NaN</td>\n",
       "      <td>W/B</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Late Leaving Garage</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Broadview Station</td>\n",
       "      <td>2.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>2014-09-04</td>\n",
       "      <td>504</td>\n",
       "      <td>07:32:00</td>\n",
       "      <td>4188.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>314</td>\n",
       "      <td>Wednesday</td>\n",
       "      <td>NaN</td>\n",
       "      <td>E/B</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Mechanical</td>\n",
       "      <td>NaN</td>\n",
       "      <td>St. Clair and Gunns Loop</td>\n",
       "      <td>3.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>2019-02-06</td>\n",
       "      <td>701</td>\n",
       "      <td>14:46:00</td>\n",
       "      <td>1226.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>167</td>\n",
       "      <td>Monday</td>\n",
       "      <td>NaN</td>\n",
       "      <td>E/B</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Mechanical</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Queen &amp; Greenwood</td>\n",
       "      <td>5.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>2019-03-04</td>\n",
       "      <td>501</td>\n",
       "      <td>07:41:00</td>\n",
       "      <td>4103.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>832</td>\n",
       "      <td>Saturday</td>\n",
       "      <td>NaN</td>\n",
       "      <td>W/B</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Mechanical</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Roncesvalles and Queen</td>\n",
       "      <td>7.0</td>\n",
       "      <td>14.0</td>\n",
       "      <td>2015-02-14</td>\n",
       "      <td>501</td>\n",
       "      <td>22:27:00</td>\n",
       "      <td>4249.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>934</td>\n",
       "      <td>Tuesday</td>\n",
       "      <td>NaN</td>\n",
       "      <td>W/B</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Investigation</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Bay and Dundas</td>\n",
       "      <td>4.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>2019-01-22</td>\n",
       "      <td>505</td>\n",
       "      <td>14:31:00</td>\n",
       "      <td>8132.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>806</td>\n",
       "      <td>Monday</td>\n",
       "      <td>NaN</td>\n",
       "      <td>E/B</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Mechanical</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Long Branch Loop</td>\n",
       "      <td>12.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2014-04-21</td>\n",
       "      <td>501</td>\n",
       "      <td>05:11:00</td>\n",
       "      <td>8065.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1113</td>\n",
       "      <td>Saturday</td>\n",
       "      <td>NaN</td>\n",
       "      <td>N/B</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Held By</td>\n",
       "      <td>NaN</td>\n",
       "      <td>lakeshore/spadina</td>\n",
       "      <td>13.0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>2016-04-30</td>\n",
       "      <td>510</td>\n",
       "      <td>21:50:00</td>\n",
       "      <td>4415.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>616</td>\n",
       "      <td>Thursday</td>\n",
       "      <td>NaN</td>\n",
       "      <td>B/W</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Held By</td>\n",
       "      <td>NaN</td>\n",
       "      <td>dovercourt/dundas</td>\n",
       "      <td>23.0</td>\n",
       "      <td>28.0</td>\n",
       "      <td>2016-07-21</td>\n",
       "      <td>505</td>\n",
       "      <td>11:22:00</td>\n",
       "      <td>4140.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>446</td>\n",
       "      <td>Friday</td>\n",
       "      <td>NaN</td>\n",
       "      <td>W/B</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Mechanical</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Queen and Leslie</td>\n",
       "      <td>5.0</td>\n",
       "      <td>11.0</td>\n",
       "      <td>2016-07-15</td>\n",
       "      <td>501</td>\n",
       "      <td>14:30:00</td>\n",
       "      <td>4088.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            Day  Delay Direction  Gap             Incident  Incident ID  \\\n",
       "536      Sunday    NaN       N/B  NaN  Late Leaving Garage          NaN   \n",
       "106    Thursday    NaN       W/B  NaN  Late Leaving Garage          NaN   \n",
       "314   Wednesday    NaN       E/B  NaN           Mechanical          NaN   \n",
       "167      Monday    NaN       E/B  NaN           Mechanical          NaN   \n",
       "832    Saturday    NaN       W/B  NaN           Mechanical          NaN   \n",
       "934     Tuesday    NaN       W/B  NaN        Investigation          NaN   \n",
       "806      Monday    NaN       E/B  NaN           Mechanical          NaN   \n",
       "1113   Saturday    NaN       N/B  NaN              Held By          NaN   \n",
       "616    Thursday    NaN       B/W  NaN              Held By          NaN   \n",
       "446      Friday    NaN       W/B  NaN           Mechanical          NaN   \n",
       "\n",
       "                      Location  Min Delay  Min Gap Report Date  Route  \\\n",
       "536       Broadview and Dundas        6.0     12.0  2017-02-12    505   \n",
       "106          Broadview Station        2.0      4.0  2014-09-04    504   \n",
       "314   St. Clair and Gunns Loop        3.0      6.0  2019-02-06    701   \n",
       "167          Queen & Greenwood        5.0      9.0  2019-03-04    501   \n",
       "832     Roncesvalles and Queen        7.0     14.0  2015-02-14    501   \n",
       "934             Bay and Dundas        4.0      8.0  2019-01-22    505   \n",
       "806           Long Branch Loop       12.0      0.0  2014-04-21    501   \n",
       "1113         lakeshore/spadina       13.0     20.0  2016-04-30    510   \n",
       "616          dovercourt/dundas       23.0     28.0  2016-07-21    505   \n",
       "446           Queen and Leslie        5.0     11.0  2016-07-15    501   \n",
       "\n",
       "          Time  Vehicle  \n",
       "536   11:48:00   4130.0  \n",
       "106   07:32:00   4188.0  \n",
       "314   14:46:00   1226.0  \n",
       "167   07:41:00   4103.0  \n",
       "832   22:27:00   4249.0  \n",
       "934   14:31:00   8132.0  \n",
       "806   05:11:00   8065.0  \n",
       "1113  21:50:00   4415.0  \n",
       "616   11:22:00   4140.0  \n",
       "446   14:30:00   4088.0  "
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# get a sample of the raw data\n",
    "df.sample(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Mechanical               37873\n",
       "Investigation            10950\n",
       "Held By                   9148\n",
       "Late Leaving Garage       7749\n",
       "General Delay             5766\n",
       "Utilized Off Route        2961\n",
       "Emergency Services        2687\n",
       "Diversion                 1191\n",
       "Overhead - Pantograph      200\n",
       "Name: Incident, dtype: int64"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['Incident'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "incident count 9\n"
     ]
    }
   ],
   "source": [
    "print(\"incident count\",df['Incident'].nunique())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2147, 13)"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# get the count of records where Min Delay is greater than Min Gap\n",
    "df[df['Min Gap'] < df['Min Delay']].shape"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
